About the project

  1. Tools and methods for open and reproducible research R, RStudio, Rmarkdown, GitHub
  2. Regression and model validation
  3. Logistic regression
  4. Clustering and classification, Discriminant analysis (DA), K-means clustering (KMC)
  5. Dimensionality reduction techniques, Principal component analysis (PCA), Multiple Correspondence analysis (MCA)

https://github.com/nimirum/IODS-project


Week 2: Regression and Model Validation

This week’s topic was about linear regression and its extended form, the multiple linear regression.

I installed ggplot2 and GGally for data anylsis. I read the students2014 data into R.

library(ggplot2)
library(GGally)
lrn14 <- read.csv('data/learning2014.txt',sep=',')

The data is a study from a course “Johdatus yhteiskuntatilastotieteeseen, syksy 2014” about learning and teaching statistics. It has 7 different variables

  1. Age (in years) derived from the date of birth
  2. Gender: Male = 1 or Female = 2
  3. Attitude is a global attitude towards statistics
  4. Deep is an combination of feelings with statisics
  5. Stra is an combination of learning strategics
  6. Surf is an combination of which things to concentrate on studies
  7. Points is total points (max of all)

Data structure and first 5 rows of it.

dim(lrn14)
## [1] 166   7
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(lrn14, n=5)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22

A graphical overview of the data

ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

There seems to be correlation with points and attitude. Stra and deep values seems to have some shape of normal distribution. The surveys participants are mostly close to age 20 and total points are fairly evenly distributed.

I choose three variables (attitude, stra, surf) as explanatory variables to fit a regression model where exam points is the target (dependent) variable.

model <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Only attitude seems to correlate (***) with points so I removed stra and surf

model <- lm(points ~ attitude, data = lrn14)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Multiple R-squared of the model is approxrimately 0,19. The best value is 1, therefore the model explains 19% of the exam points is due to attitude.

Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

par(mfrow = c(2,2))
plot(model, which = c(1,2,5))

It seems that the only plot that fits the data is Normal QQ.


Week 3: Logistic regresion

This week’s topic was about logistic regression

The data

I read the students performance and alcoholic consumption data into R.

library(tidyr); library(dplyr); library(ggplot2); library(boot)
alc <- read.csv('data/students_alc2014.csv',sep=',')

The data set is from UCI Machine Learning Repository, Student Performance Data Set and it contains students performance in mathematics and portuguese, including alcohol consumption. The acl data is a joined data frame from mathematics and portuguese data. The data set has meen modified so that the ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’. The variable ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

Glimpse of data and first 5 rows of it.

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
head(alc, n=5)
##   school sex age address famsize Pstatus Medu Fedu    Mjob     Fjob reason
## 1     GP   F  18       U     GT3       A    4    4 at_home  teacher course
## 2     GP   F  17       U     GT3       T    1    1 at_home    other course
## 3     GP   F  15       U     LE3       T    1    1 at_home    other  other
## 4     GP   F  15       U     GT3       T    4    2  health services   home
## 5     GP   F  16       U     GT3       T    3    3   other    other   home
##   nursery internet guardian traveltime studytime failures schoolsup famsup
## 1     yes       no   mother          2         2        0       yes     no
## 2      no      yes   father          1         2        0        no    yes
## 3     yes      yes   mother          1         2        2       yes     no
## 4     yes      yes   mother          1         3        0        no    yes
## 5     yes       no   father          1         2        0        no    yes
##   paid activities higher romantic famrel freetime goout Dalc Walc health
## 1   no         no    yes       no      4        3     4    1    1      3
## 2   no         no    yes       no      5        3     3    1    1      3
## 3  yes         no    yes       no      4        3     2    2    3      3
## 4  yes        yes    yes      yes      3        2     2    1    1      5
## 5  yes         no    yes       no      4        3     2    1    2      5
##   absences G1 G2 G3 alc_use high_use
## 1        5  2  8  8     1.0    FALSE
## 2        3  7  8  8     1.0    FALSE
## 3        8 10 10 11     2.5     TRUE
## 4        1 14 14 14     1.0    FALSE
## 5        2  8 12 12     1.5    FALSE

Hypothesis testing

I choose 4 variables and present a hypothesis on how they correlate with high/low alchohol consumption. The variables are failure, sex, activities and absences from school. Below are my hypothesis:

H1: High alcohol consumption realtes to the amount of course failures.

H2: Sex indicates alcohol consumption.

H3: More time spent on activities implies less alcohol consupmtion

H4: High amount of absences relates to high alcohol consumption.

Below is the data structure.

variables <- c("high_use","failures","activities","sex","absences")
alc_hyp_test <- select(alc, one_of(variables))
glimpse(alc_hyp_test)
## Observations: 382
## Variables: 5
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
gather(alc_hyp_test) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill="darkslategray4")

g1 <- ggplot(alc, aes(x = failures, fill=high_use))
g1 + geom_bar() + ggtitle("Student failures and alcohol consumption")

g2 <- ggplot(alc, aes(x = sex, fill=high_use))
g2 + geom_bar() + ggtitle("Student sex and alcohol consumption")

g3 <- ggplot(alc, aes(x = activities, fill=high_use))
g3 + geom_bar() + ggtitle("Student activities and alcohol consumption")

g4 <- ggplot(alc, aes(x = high_use, y = absences, fill=high_use))
g4 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

About 1/3 of students are categorized as high alcohol consumption users. In first chart (g1) it is hard to say if high alchohol use affect failures or not. The second chart (g2) indicates that males consumpt more alcohol than women. In the third chart (g3) seems that activities implies less achohol use, but not that much. In the last chart (g4) it looks like absences are related to high alchohol consumption.

Logistic regression

model <- glm(high_use ~ failures + sex + absences + activities -1, data = alc_hyp_test, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + sex + absences + activities - 
##     1, family = "binomial", data = alc_hyp_test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2605  -0.8527  -0.5999   1.0786   2.1020  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## failures       0.43063    0.19104   2.254  0.02419 *  
## sexF          -1.74791    0.24779  -7.054 1.74e-12 ***
## sexM          -0.76106    0.23390  -3.254  0.00114 ** 
## absences       0.09382    0.02295   4.089 4.33e-05 ***
## activitiesyes -0.34489    0.24071  -1.433  0.15191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 529.56  on 382  degrees of freedom
## Residual deviance: 422.34  on 377  degrees of freedom
## AIC: 432.34
## 
## Number of Fisher Scoring iterations: 4

The students high alchohol use seems to relate most with sex and amount of absences because the coefficents are ***. I drop activities and reproduce logistic regression without it.

model <- glm(high_use ~ failures + sex + absences, data = alc_hyp_test, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + sex + absences, family = "binomial", 
##     data = alc_hyp_test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1855  -0.8371  -0.6000   1.1020   2.0209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.90297    0.22626  -8.411  < 2e-16 ***
## failures     0.45082    0.18992   2.374 0.017611 *  
## sexM         0.94117    0.24200   3.889 0.000101 ***
## absences     0.09322    0.02295   4.063 4.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.40  on 378  degrees of freedom
## AIC: 432.4
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %   97.5 %
## (Intercept) 0.1491257 0.09395441 0.228611
## failures    1.5695984 1.08339644 2.294737
## sexM        2.5629682 1.60381392 4.149405
## absences    1.0977032 1.05169654 1.150848

In here we can see that people with alcohol use are more likely to have high course failure count and high chance of having absences. In odds ratio model the high alcohol students are more probably males than females. These results support my hypothesis.

Prediction

probabilities <- predict(model, type = "response")

alc_hyp_test <- mutate(alc_hyp_test, probability = probabilities)
alc_hyp_test <- mutate(alc_hyp_test, prediction = probabilities > 0.5)
table(high_use = alc_hyp_test$high_use, prediction = probabilities > 0.5)
##         prediction
## high_use FALSE TRUE
##    FALSE   259    9
##    TRUE     84   30
g <- ggplot(alc_hyp_test, aes(x = probability, y = high_use, col= prediction))
g + geom_point()

table(high_use = alc_hyp_test$high_use, prediction = alc_hyp_test$prediction) %>% prop.table()  %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67801047 0.02356021 0.70157068
##    TRUE  0.21989529 0.07853403 0.29842932
##    Sum   0.89790576 0.10209424 1.00000000
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc_hyp_test$high_use, prob = alc_hyp_test$probability)
## [1] 0.2434555

The prediction model predicts about 24% of the predictions wrong.

Cross validation

cv <- cv.glm(data = alc_hyp_test, cost = loss_func, glmfit = model, K = 10)

cv$delta[1]
## [1] 0.2513089

Cross validation (k=10) seems to add the amount of wrong predictions


Week 4: Clustering and classification

This week’s topic was about exploring statistical data

The data

I load the Boston data from MASS libray into R.

library(dplyr);library(tidyr);library(ggplot2)
library(boot);library(MASS);library(corrplot)
library(plotly)
#library(tidyverse)

#Load the Boston data from the MASS package. 
data("Boston")

The Boston data set contains information about factors effecting housing values in suburban areas of Boston. The data includes e.g crimerate (crim), full value property tax rate per 10000$ (tax) and pupil- teacher ratio (pratio).

#Explore the structure and the dimensions of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the summary, we can see that all the variables in this data frame are numeric.

# The correlation matrix 
cor_matrix<-cor(Boston) %>% round(digits = 2)

# Plotting the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b",tl.pos = "d",tl.cex = 0.6)

Above is the correlation chart of the values. In there it’s visible that rad (index of accessibility to radial highways) correlates positively to dis (weighted mean of distances to five Boston employment centres) and lstat(lower status of the population (percent)) correlates positively with medv (median value of owner-occupied homes in $1000s).

The data must be scaled before doing any further analysis from it.

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

Crime rates are categorized to four categories

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low","med_high","high"))
# save it
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Randomly pick training (80%) and testing (20%) data

# choose randomly 80% of the rows
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)

# create train and test set
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Linear discriminant analysis, in which I use categorical crime rate as the target variable and all the other variables as predictor variables.

lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2475248 0.2623762 0.2400990 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9296157 -0.9261779 -0.07742312 -0.8778804  0.42639628
## med_low  -0.1133516 -0.2626364 -0.03610305 -0.5421757 -0.15736079
## med_high -0.3723618  0.2209324  0.13623794  0.4160474  0.07751014
## high     -0.4872402  1.0172187 -0.02879709  1.0413202 -0.44966689
##                 age        dis        rad        tax     ptratio
## low      -0.8654009  0.8775595 -0.6828149 -0.7418498 -0.47383669
## med_low  -0.3147161  0.3561123 -0.5488991 -0.4555504 -0.06907037
## med_high  0.4024629 -0.3903883 -0.4217227 -0.3008765 -0.32283945
## high      0.7992979 -0.8466358  1.6371072  1.5133254  0.77958792
##                black       lstat        medv
## low       0.37699981 -0.76251163  0.52092006
## med_low   0.34748360 -0.10125431 -0.01987651
## med_high  0.08979639  0.04694521  0.18014544
## high     -0.63155403  0.87809704 -0.68475410
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09186832  0.702891216 -0.96690457
## indus    0.04436107 -0.286916791  0.29952767
## chas    -0.08731176  0.004350385  0.05963200
## nox      0.34157645 -0.811906408 -1.28178167
## rm      -0.06247102 -0.127700420 -0.11624383
## age      0.26237761 -0.215567366 -0.09579876
## dis     -0.04724658 -0.282516751  0.23690261
## rad      3.17761478  0.970533112 -0.13015647
## tax     -0.01239689 -0.013533005  0.60779828
## ptratio  0.12628938 -0.043204355 -0.26907330
## black   -0.11940031  0.058631472  0.22110326
## lstat    0.23158515 -0.334926101  0.38607805
## medv     0.18376460 -0.515893403 -0.16837620
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9446 0.0421 0.0133

Plotting LDA results with a biplot

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Calculate LDA predicting performance

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       6        2    0
##   med_low    6      18        2    0
##   med_high   0      10        9    1
##   high       0       0        0   30

Loading the Boston data again and preparing it for clustering

data('Boston')
Boston <- scale(Boston)
dist_eu <- dist(Boston)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Clustering the data with 4 cluter centroids

km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)

Finding the optimal amount of clusters by calculating total of within cluster sum of squares (WCSS) for clusters between 1 to 10.

# calculate WCSS
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Optimal amount of cluster centroids seems to be 2 from the plot above. The best number of clusters is when the total WCSS drops radically.

K-means clustering again but with 2 cluster centroids.

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

3D plot with crime categories

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)

Week 5: Dimensionality reduction techniques

The data

The data is from UNs Development programs Human Development Index (HDI) and Gender Inequality Index data frames. These two are combined together to be a data frame named “Human”. More information about the data UN development programs: HDI.

# dependecies
library(dplyr); library(tidyr); library(dplyr); library(ggplot2)
library(stringr); library(FactoMineR); library(GGally); library(corrplot)
# download human data
human <- read.csv("data/human.csv")
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ X        : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 28 102 ...
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   9
head(human)
##             X   Edu2.FM   Labo.FM Edu.Exp Life.Exp    GNI Mat.Mor
## 1      Norway 1.0072389 0.8908297    17.5     81.6 64,992       4
## 2   Australia 0.9968288 0.8189415    20.2     82.4 42,261       6
## 3 Switzerland 0.9834369 0.8251001    15.8     83.0 56,431       6
## 4     Denmark 0.9886128 0.8840361    18.7     80.2 44,025       5
## 5 Netherlands 0.9690608 0.8286119    17.9     81.6 45,435       6
## 6     Germany 0.9927835 0.8072289    16.5     80.9 43,919       7
##   Ado.Birth Parli.F
## 1       7.8    39.6
## 2      12.1    30.5
## 3       1.9    28.5
## 4       5.1    38.0
## 5       6.2    36.9
## 6       3.8    36.9

The structure of the data shows the 9 variables. X is the row.names that are the names of the countries.

# visualize human data
summary(human)
##            X          Edu2.FM          Labo.FM          Edu.Exp     
##  Afghanistan:  1   Min.   :0.1717   Min.   :0.1857   Min.   : 5.40  
##  Albania    :  1   1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25  
##  Algeria    :  1   Median :0.9375   Median :0.7535   Median :13.50  
##  Argentina  :  1   Mean   :0.8529   Mean   :0.7074   Mean   :13.18  
##  Armenia    :  1   3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20  
##  Australia  :  1   Max.   :1.4967   Max.   :1.0380   Max.   :20.20  
##  (Other)    :149                                                    
##     Life.Exp          GNI         Mat.Mor         Ado.Birth     
##  Min.   :49.00   1,123  :  1   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.:66.30   1,228  :  1   1st Qu.:  11.5   1st Qu.: 12.65  
##  Median :74.20   1,428  :  1   Median :  49.0   Median : 33.60  
##  Mean   :71.65   1,458  :  1   Mean   : 149.1   Mean   : 47.16  
##  3rd Qu.:77.25   1,507  :  1   3rd Qu.: 190.0   3rd Qu.: 71.95  
##  Max.   :83.50   1,583  :  1   Max.   :1100.0   Max.   :204.80  
##                  (Other):149                                    
##     Parli.F     
##  Min.   : 0.00  
##  1st Qu.:12.40  
##  Median :19.30  
##  Mean   :20.91  
##  3rd Qu.:27.95  
##  Max.   :57.50  
## 
plot(human, col="blue")

The summary of the variables shows the types of data they hold in. X is as described just a list of country names. Edu2FM is the ratio of secondary education and Labo.FM ratio of labour force participation. Life Exp shows the life expectancy in different nations how long on average people live approximately. Expected years of education (Edu.exp) shows that in minimum years people go to school.

human$GNI<- str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric()
human_ <- select(human, -X)

cor(human_) %>% corrplot()

Principal component analysis

#PCA on the not standardized human data.
pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2,cex=c(0.8,1),col = c("grey40", "deeppink2"))

#PCA on standardized/scaled variables
human_std <- scale(human_)
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2,cex=c(0.8,1),col = c("grey40", "deeppink2"))

Above here is the biplot of data frame after PCA. The frame has not been standardized yet, which is why it has a strange shape. PCA maximizes the variance between variables and therefore without standardizing they tend to pack tightly or show only observations of one variable clearly. Standardized data set variable means are scaled in the middle and the plot is much easier to read.

Multiple Correspondence Analysis

A dataset “tea” from FactoMineR library. The data contains answers from poll on things related to tea time.

data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
keep_columns<- c("Tea", "evening", "dinner", "lunch", "where")
tea_time<- select(tea, one_of(keep_columns))

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  +geom_bar() +theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Multiple Correspondence Analysis with variables Tea, Evening, Dinner, Lunch and Where.

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.275   0.221   0.209   0.204   0.185   0.157
## % of var.             19.619  15.761  14.926  14.574  13.242  11.226
## Cumulative % of var.  19.619  35.380  50.306  64.880  78.121  89.347
##                        Dim.7
## Variance               0.149
## % of var.             10.653
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           |  0.108  0.014  0.013 | -0.612  0.566  0.427 | -0.508  0.412
## 2           |  0.108  0.014  0.013 | -0.612  0.566  0.427 | -0.508  0.412
## 3           |  0.514  0.321  0.080 |  0.398  0.239  0.048 |  0.585  0.545
## 4           |  0.864  0.907  0.247 | -0.052  0.004  0.001 |  0.738  0.868
## 5           | -0.323  0.126  0.159 |  0.148  0.033  0.033 | -0.186  0.055
## 6           |  0.864  0.907  0.247 | -0.052  0.004  0.001 |  0.738  0.868
## 7           |  0.028  0.001  0.002 | -0.302  0.138  0.242 | -0.034  0.002
## 8           | -0.242  0.071  0.051 | -0.162  0.040  0.023 | -0.661  0.697
## 9           | -0.176  0.038  0.037 | -0.348  0.183  0.145 |  0.678  0.734
## 10          | -0.446  0.241  0.123 | -0.208  0.065  0.027 |  0.051  0.004
##               cos2  
## 1            0.294 |
## 2            0.294 |
## 3            0.104 |
## 4            0.180 |
## 5            0.053 |
## 6            0.180 |
## 7            0.003 |
## 8            0.378 |
## 9            0.552 |
## 10           0.002 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black       |  -0.037   0.024   0.000  -0.363 |  -0.651   9.487   0.139
## Earl Grey   |  -0.247   2.867   0.110  -5.746 |   0.076   0.339   0.010
## green       |   1.529  18.732   0.289   9.297 |   1.015  10.270   0.127
## evening     |  -0.603   9.089   0.190  -7.539 |   0.694  14.991   0.252
## Not.evening |   0.315   4.752   0.190   7.539 |  -0.363   7.838   0.252
## dinner      |   2.039  21.192   0.313   9.673 |   0.547   1.895   0.022
## Not.dinner  |  -0.153   1.595   0.313  -9.673 |  -0.041   0.143   0.022
## lunch       |  -1.104  13.026   0.210  -7.917 |   1.532  31.188   0.403
## Not.lunch   |   0.190   2.239   0.210   7.917 |  -0.263   5.360   0.403
## chain store |  -0.032   0.047   0.002  -0.730 |  -0.119   0.818   0.025
##              v.test     Dim.3     ctr    cos2  v.test  
## black        -6.445 |  -0.668  10.529   0.146  -6.608 |
## Earl Grey     1.770 |   0.416  10.679   0.313   9.671 |
## green         6.170 |  -0.938   9.265   0.109  -5.703 |
## evening       8.678 |  -0.230   1.731   0.028  -2.870 |
## Not.evening  -8.678 |   0.120   0.905   0.028   2.870 |
## dinner        2.593 |   1.639  18.005   0.202   7.777 |
## Not.dinner   -2.593 |  -0.123   1.355   0.202  -7.777 |
## lunch        10.980 |  -0.045   0.029   0.000  -0.325 |
## Not.lunch   -10.980 |   0.008   0.005   0.000   0.325 |
## chain store  -2.737 |  -0.498  15.165   0.440 -11.472 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## Tea         | 0.297 0.222 0.318 |
## evening     | 0.190 0.252 0.028 |
## dinner      | 0.313 0.022 0.202 |
## lunch       | 0.210 0.403 0.000 |
## where       | 0.364 0.204 0.496 |
plot(mca, invisible=c("ind"), habillage = "quali")

Three variables stand from the crowd: dinner, tea shop and green tea. They don’t come close to the dimensions and seems correlate with each other. All the other variables gather around the origo, except lunch.